Skip to content

Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding#1070

Open
Copilot wants to merge 17 commits intomainfrom
copilot/fix-binary-data-encoding
Open

Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding#1070
Copilot wants to merge 17 commits intomainfrom
copilot/fix-binary-data-encoding

Conversation

Copy link
Contributor

Copilot AI commented Dec 5, 2025

Plan: Fix Binary Data Representation in Protocol Types

Based on the issue analysis, I need to change how binary data is represented in the protocol types to avoid unnecessary UTF-16 transcoding.

Changes Completed:

  • Update BlobResourceContents property

    • Changed Blob from string to ReadOnlyMemory<byte> for base64-encoded UTF-8 bytes
    • Renamed Data to DecodedData for consistency with ContentBlock types
    • Added lazy decoding with caching using Base64.DecodeFromUtf8 (no string intermediate)
    • Added cache invalidation when Blob is set
    • Added Debug.Assert for encode length validation
    • Added FromData factory method for efficient initialization
    • Removed "zero-copy" claims from documentation
  • Update ImageContentBlock.Data property

    • Changed from string to ReadOnlyMemory<byte>
    • Added lazy DecodedData property with Base64.DecodeFromUtf8 (no string intermediate)
    • Added cache invalidation when Data is set
    • Renamed FromImage to FromBytes
    • Removed "zero-copy" claims from documentation
  • Update AudioContentBlock.Data property

    • Changed from string to ReadOnlyMemory<byte>
    • Added lazy DecodedData property with Base64.DecodeFromUtf8 (no string intermediate)
    • Added cache invalidation when Data is set
    • Renamed FromAudio to FromBytes
    • Removed "zero-copy" claims from documentation
  • Created Base64Helpers class

    • Dedupl icated encode/decode logic into Base64Helpers.EncodeToBase64Utf8 and DecodeFromBase64Utf8
    • Includes conditional compilation for NET vs netstandard2.0
    • All three types (BlobResourceContents, ImageContentBlock, AudioContentBlock) use these helpers
  • Fixed AIContentExtensions.cs

    • Updated to use DecodedData property
    • Removed unsafe code from GetUtf8Bytes helper
    • Use span-based Encoding methods with polyfills for netstandard2.0
  • Fixed sample files

    • Moved System.Text to using statements per code review feedback
    • Simplified fully-qualified names
  • Created polyfills

    • Added EncodingExtensions polyfill for span-based GetByteCount and GetBytes methods
    • Enabled AllowUnsafeBlocks in ModelContextProtocol.csproj and tests/Directory.Build.props
  • Fixed AIFunctionMcpServerResource.cs

    • Created GetBase64Utf8Bytes helper to deduplicate DataContent Base64Data extraction
    • Used helper in both switch cases
  • Updated all tests

    • Replaced FromImage with FromBytes
    • All tests passing
  • Build and test

    • Build succeeds across all target frameworks
    • All protocol and binary data tests passing

Summary:

Successfully converted binary data properties from string to ReadOnlyMemory<byte> with complete elimination of UTF-16 transcoding:

  1. Wire format: Base64-encoded UTF-8 bytes stored directly from JSON
  2. Decoding: Direct UTF-8 → binary via Base64.DecodeFromUtf8 (no string intermediate on .NET)
  3. Caching: Lazy decode with result caching
  4. Initialization: Factory methods (FromData, FromBytes) for efficient creation from binary data
  5. Consistency: DecodedData naming across all binary types, FromBytes naming for factory methods
  6. Code reuse: Base64Helpers class deduplicates encode/decode logic
  7. Helper methods: GetBase64Utf8Bytes deduplicates DataContent extraction
Original prompt

This section details on the original issue you should resolve

<issue_title>Content types that represent binary data are exposing UTF16 encoded strings</issue_title>
<issue_description>

/// <summary>
/// Gets or sets the base64-encoded string representing the binary data of the item.
/// </summary>
[JsonPropertyName("blob")]
public required string Blob { get; set; }

/// <summary>
/// Gets or sets the base64-encoded image data.
/// </summary>
[JsonPropertyName("data")]
public required string Data { get; set; }

/// <summary>
/// Gets or sets the base64-encoded audio data.
/// </summary>
[JsonPropertyName("data")]
public required string Data { get; set; }

Potentially more -- https://github.com/search?q=repo%3Amodelcontextprotocol%2Fcsharp-sdk%20%22base64-encoded%22&type=code

Doing this means that the UTF8 data that comes across the wire is transcoded to UTF16 just to be then base64 decoded to bytes when anyone uses it. The intermediate step to UTF16 (.NET string) is unnecessary, wastes cycles, creates garbage, and makes the API harder to use.

Instead these types should keep the data in UTF8 and then have lazy getters that will do the base64-decoding. We can tell folks to use those decoded getter properties.

https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/DataContent.cs follows a similar pattern, though it uses the Uri as it's serialized representation. It seems these MCP types should use the UTF8 encoded data as their serialized representation.

The OpenAI SDK exposes a BinaryData as its representation to deal with this problem https://github.com/openai/openai-dotnet/blob/636075222205d626a770acb1af1e5cc060abd517/src/Custom/Images/GeneratedImage.cs#L23 however that's slightly less usable since it forces the caller to decode.

The proposal here would be to have these be instead:

/// <summary> 
/// Gets or sets the base64-encoded UTF-8 byes representing the binary data of the item.
/// </summary> 
/// <remarks>
/// This is a zero-copy representation of the wire payload of this item.  Setting this value will invalidate any cached value of <see cref="Data" />.
/// </remarks>
[JsonPropertyName("blob")] 
public required ReadOnlyMemory<byte> Blob { get; set; } 

/// <summary> 
/// Gets the decoded data represented by <see cref="Blob" />.
/// </summary>
/// <remarks>
/// Accessing this member will decode the value in <see cref="Blob" /> and cache the result.
/// </remarks>
[JsonIgnore]
public byte[] Data { get; }

</issue_description>

<agent_instructions>Please find any protocol types that expose binary data as string (typically base64 encoded) and follow the cited recommendations for how to expose them differently.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@stephentoub > The proposal here would be to have these be instead:

Should Data also be a ReadOnlyMemory<byte>?

Is Data going to cache the decoded data lazily, or is it going to produce a new array on each access? If the latter, it shouldn't be a property.

what do you think about the name Blob

I'm torn. On the one hand, I agree with you. On the other hand, it's nice to keep it the same or super close to the spec. Offline you'd proposed something like Base64Blob; is that still your first choice?</comment_new>


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 3 commits December 5, 2025 23:00
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix binary data types exposing UTF16 encoded strings Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding Dec 5, 2025
Copilot AI requested a review from stephentoub December 5, 2025 23:28
@stephentoub stephentoub requested a review from ericstj December 8, 2025 03:42
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
@stephentoub
Copy link
Contributor

@ericstj, I'm not sure this will actually end up saving much end-to-end. It'd help if some of the data wasn't actually looked at, but if it ends up being looked at, it seems it'll actually make the end-to-end worse.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
@ericstj
Copy link
Collaborator

ericstj commented Dec 8, 2025

I see this PR is still going though a string when decoding - I thought we had API that would go straight from ROS (base64 UTF8) to byte[] without an intermediate string.

@stephentoub
Copy link
Contributor

I thought we had API that would go straight from ROS (base64 UTF8) to byte[] without an intermediate string.

We have operation status APIs that write into an existing destinatiom buffer, but no one-shot that creates the array. We do for Base64Url but we didn't square the circle for Base64.

@ericstj
Copy link
Collaborator

ericstj commented Dec 8, 2025

Can't we do Base64.GetMaxDecodedFromUtf8Length to get the size, create a byte array, then Base64.DecodeFromUtf8, followed by slice to final actual size?

@stephentoub
Copy link
Contributor

Base64.DecodeFromUtf8

That's the operation status API I was referring to.

Can't we do Base64.GetMaxDecodedFromUtf8Length to get the size, create a byte array, then Base64.DecodeFromUtf8, followed by slice to final actual size?

I thought you'd stated you wanted the decoded data property to be a byte[]. If it is, slicing would entail allocating a whole new array unless the max happened to equal actual, which would happen only ~33% of the time.

If the destination property is a ReadOnlyMemory<byte>, then yes, it can be done that way.

@ericstj
Copy link
Collaborator

ericstj commented Dec 9, 2025

If the destination property is a ReadOnlyMemory<byte>, then yes, it can be done that way.

This is exactly what I was thinking of when mentioning:

If using ROM avoids a copy it's acceptable.

And the current implementation is already exposing as public ReadOnlyMemory<byte> DecodedData so I think we've already committed to this. Let's go ahead and remove those string intermediates. @copilot

Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 9, 2025

If the destination property is a ReadOnlyMemory<byte>, then yes, it can be done that way.

This is exactly what I was thinking of when mentioning:
...

Implemented using Base64.DecodeFromUtf8 to decode directly from UTF-8 bytes without string intermediate in commit 39213fb. The implementation:

  • Uses Base64.GetMaxDecodedFromUtf8Length to allocate the buffer
  • Calls Base64.DecodeFromUtf8 to decode directly
  • Slices to actual size using ReadOnlyMemory when needed (only allocates new array ~33% of time when actual != max)
  • Falls back to optimized string-based approach for .NET Standard 2.0

@stephentoub
Copy link
Contributor

@ericstj, is this something you're still interested in pushing forward?

@ericstj
Copy link
Collaborator

ericstj commented Jan 21, 2026

Yes, thanks for the reminder. Let me refresh my memory on it and have a look this week.

@ericstj
Copy link
Collaborator

ericstj commented Jan 27, 2026

Fixing up the merge conflict.

@ericstj
Copy link
Collaborator

ericstj commented Jan 30, 2026

Test failure is #1222

@ericstj ericstj marked this pull request as ready for review February 2, 2026 18:31
ericstj
ericstj previously approved these changes Feb 5, 2026
@ericstj
Copy link
Collaborator

ericstj commented Feb 5, 2026

@stephentoub when you have a chance would appreciate your pass on this once more, since I'm the last committer.

new BlobResourceContents
{
Blob = resource.Description!,
Blob = System.Text.Encoding.UTF8.GetBytes(resource.Description!),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, we prefer to use simple rather than fully-qualified names. Can these "System.Text"s be moved instead to usings at the beginning?

Comment on lines 454 to 465
// gets UTF-8 bytes from UTF-16 chars without intermediate string allocations
fixed (char* pChars = utf16)
{
var byteCount = System.Text.Encoding.UTF8.GetByteCount(pChars, utf16.Length);
var bytes = new byte[byteCount];

fixed (byte* pBytes = bytes)
{
System.Text.Encoding.UTF8.GetBytes(pChars, utf16.Length, pBytes, byteCount);
}
return bytes;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, this shouldn't require unsafe code on .NET Core... there are overloads that work with spans, so this can be e.g.

byte[] bytes = new byte[Encoding.UTF8.GetByteCount(utf16)]
Encoding.UTF8.GetBytes(utf16, bytes);
return bytes;

For the netstandard2.0 build, you can add polyfills for these methods in https://github.com/modelcontextprotocol/csharp-sdk/tree/main/src/Common/Polyfills/System

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had used the unsafe code to avoid the ifdef, I agree a polyfill approach should work. Thanks for suggestion.

throw new FormatException("Invalid base64 data");
}
}
return _decodedData.Value;
Copy link
Contributor

@stephentoub stephentoub Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot

Suggested change
return _decodedData.Value;
return _decodedData.Value;

/// </para>
/// </remarks>
[JsonIgnore]
public ReadOnlyMemory<byte> Data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericstj why do you call this Data here but DecodedData elsewhere like on ContentBlock?

Copy link
Collaborator

@ericstj ericstj Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change it.

The change was made here 4287694#diff-9da29ad32a17c90e9b28c79ad3ab9c4b370a757e26657c736020ea47314a60ba

I had assumed we would have liked to use Data everywhere, since that's the better name and more-likely the "useful" property for the caller. The ContentBlock types had already used Data to mean the encoded data, so we didn't change that.

I agree that consistency is better here - lets use DecodedData here. What are your thoughts on Blob vs Data for the encoded data? Should we keep those different between content and resource types (since it's preexisting discrepancy) or try to make the consistent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep those different between content and resource types (since it's preexisting discrepancy) or try to make the consistent?

I think there's value in keeping the names of these C# properties aligned with their spec / JSON schema counterparts. I don't know why they differ in the MCP spec, but alas.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we'll keep them different, but rename Data to DecodedData here.

@ericstj
Copy link
Collaborator

ericstj commented Feb 6, 2026

@copilot please address feedback

…sed encoding, remove System.Text qualifiers

Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 6, 2026

@copilot please address feedback

Addressed all feedback in commit 0a3015f:

  • Renamed Data to DecodedData in BlobResourceContents for consistency
  • Removed unsafe code from AIContentExtensions, using span-based Encoding methods
  • Created EncodingExtensions polyfill for netstandard2.0 support
  • Moved System.Text to using statements in sample files
  • Added Debug.Assert for bytesWritten validation
  • Return _decodedData.Value directly as suggested
  • All binary data tests passing

Copilot AI requested a review from ericstj February 6, 2026 16:23
/// the Model Context Protocol. The binary data is represented as a base64-encoded string
/// in the <see cref="Blob"/> property.
/// the Model Context Protocol. The binary data is represented as base64-encoded UTF-8 bytes
/// in the <see cref="Blob"/> property, providing a zero-copy representation of the wire payload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still copied. It just doesn't have its encoding changed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we don't control the buffering during serialization or to network payload. This type itself won't copy when initialized with pre-encoded bytes. Hypothetically something using the type could slot these bytes into a a sequence of ROM for a zero-copy solution.

I agree that this is not really a useful doc change though. This is just a case of copilot making the rookie mistake of commenting on the diff instead of the result.

We should remove all mention of zero-copy from this change.

byte[] buffer = new byte[maxLength];
if (Base64.EncodeToUtf8(data.Span, buffer, out _, out int bytesWritten) == OperationStatus.Done)
{
Debug.Assert(bytesWritten == buffer.Length, "Base64 encoding should always produce exact length for non-padded input");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Debug.Assert(bytesWritten == buffer.Length, "Base64 encoding should always produce exact length for non-padded input");
Debug.Assert(bytesWritten == buffer.Length, "Base64 encoding should always produce the same length as the max length");

/// Gets or sets the base64-encoded UTF-8 bytes representing the binary data of the item.
/// </summary>
/// <remarks>
/// This is a zero-copy representation of the wire payload of this item. Setting this value will invalidate any cached value of <see cref="DecodedData"/>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "zero-copy representation of the wire payload" part of this is confusing.

throw new FormatException("Invalid base64 data");
}
}
return _decodedData.Value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return _decodedData.Value;
return _decodedData.Value;

/// <param name="mimeType">The MIME type of the audio.</param>
/// <returns>A new <see cref="AudioContentBlock"/> instance.</returns>
/// <exception cref="InvalidOperationException"></exception>
public static AudioContentBlock FromAudio(ReadOnlyMemory<byte> audioData, string mimeType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the FromImage and FromAudio naming. These are already _Image_ContentBlock and _Audio_ContentBlock. Is the word in the name duplicative? Should these be FromBytes or FromRawData or something like that? If we can't come up with anything better, we can go with the current names, it's just a little strange to me, both with the duplication and with this saying it's from audio/images but the arguments are arbitrary bytes/mimeType, and there's no validation that it's actually audio/image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with FromBytes. I was thinking FromData didn't work because of existing Data properties. Let's rename to FromBytes and mention in comments that it represents decoded bytes and using these methods will retain a reference of the provided bytes as the DecodedData and will encode them to Data.

Comment on lines 503 to 513
// Encode directly to UTF-8 base64 bytes without string intermediate
int maxLength = Base64.GetMaxEncodedToUtf8Length(audioData.Length);
byte[] buffer = new byte[maxLength];
if (Base64.EncodeToUtf8(audioData.Span, buffer, out _, out int bytesWritten) == OperationStatus.Done)
{
data = buffer.AsMemory(0, bytesWritten);
}
else
{
throw new InvalidOperationException("Failed to encode binary data to base64");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this exact code duplicated? Can we dedup into a helper function? Same for the other direction.

throw new FormatException("Invalid base64 data");
}
}
return _decodedData.Value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return _decodedData.Value;
return _decodedData.Value;

Comment on lines 399 to 401
Blob = MemoryMarshal.TryGetArray(dc.Base64Data, out ArraySegment<char> segment)
? Encoding.UTF8.GetBytes(segment.Array!, segment.Offset, segment.Count)
: Encoding.UTF8.GetBytes(dc.Base64Data.ToString())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the below can also be deduped into a helper function for getting an array from a DataContent

@ericstj
Copy link
Collaborator

ericstj commented Feb 7, 2026

@copilot please address feedback, and be sure to consider variants of each suggestion.

…dio to FromBytes, add Base64 helpers, fix Encoding references

Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Feb 7, 2026

@copilot please address feedback, and be sure to consider variants of each suggestion.

Addressed all feedback in commit e4f2816:

  • Removed all "zero-copy" claims from documentation
  • Updated Debug.Assert message as suggested
  • Renamed FromImage/FromAudio to FromBytes with updated documentation
  • Created Base64Helpers class to deduplicate encode/decode logic (used by BlobResourceContents, ImageContentBlock, AudioContentBlock)
  • Created GetBase64Utf8Bytes helper to deduplicate DataContent Base64Data extraction
  • Fixed System.Text.Encoding references in Base64Helpers for netstandard2.0
  • Updated all tests to use FromBytes
  • All tests passing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Content types that represent binary data are exposing UTF16 encoded strings

5 participants